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1. Introduction 


Many social surveys have as their main purpose, the analysis of 
relationships between variables. In particular, studies of public reactions 
to aircraft noise generally have as a principal goal, the estimation of 
regression parameters for a model predicting annoyance as a function of 
various measures of noise exposure. For example, in studying the trade-off 
between noise levels and numbers of events, the following two-variable 
regression model is commonly employed: 

(1.1) yj = p 0 + Mli + P2 x 2i + e i 

where, yi is a measure of annoyance associated with the ith individual in the 
population, x,j and x 2l - are measures of the noise level and number of events, 
respectively, to which the ith individual is exposed, and the ej are 
independent random disturbance terms. 

For simple random sample designs, the model given in ( 1 . 1 ) is quite 
adequate. Discussion of design and analysis issues for such models appears 
in many standard texts on multiple regression. See for example, Draper and 
Smith ( 1 98 1 ) or Neter, Wasserman and Kutner ( 1 985). 

More commonly, samples for social surveys are drawn using complex 
sampling schemes, usually stratified, multi-stage cluster sample designs. 
For example, studies of residents' responses to noise most often consist of 
interviews with samples of individuals drawn from a number of different 
compact study areas, usually neighborhoods. In order to design such studies, 
it is necessary to determine the numbers of individuals and numbers of 
study areas to include in order to achieve specified research objectives. 

The statistical techniques developed in this report provide a basis for these 
sample design decisions. 

Optimal design and estimation for means and totals using these 
designs is well understood. (Cochran, 1963) On the other hand, no such 
consensus exists for design and estimation for regression parameters using 
such samples. One approach to this problem is described by Kalton ( 1 983) in 
an earlier NASA Contractor Report. His methodology is briefly described in 
Section 2, below. Kalton employed regression models which incorporate 
nested random intercepts associated with various stages of a multi-stage 
cluster sample design. For cases where there is little or no variability in 
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predictor variables within clusters, this approach provides useful results. 
However, when such variability does exist, it can lead to results which 
seem counter-intuitive. 

In Section 3, we build on the models proposed by Kalton. Three 
regression models are presented for which the regression parameters 
themselves are considered random, with components of variability 
corresponding to the stages of a multi-stage cluster sample design. These 
models differ in the assumptions regarding variability of model parameters. 
For each, sampling variances and covariances are derived for estimates of 
linear regression parameters and ratios of parameters. Optimal allocation 
of sample resources across the stages of the design is derived for each 
situation. These allocations depend, in part, on estimates of variability at 
the various levels of the sample designs. Variance component estimates for 
this purpose are derived. These estimates could be obtained from existing 
data, contributing to the efficient design of future surveys. In Section 4, 
we apply some of these techniques in a simple example. Finally, the results 
of this research are summarized in Section 5. 

2. Models with Fixed Regression Slopes. 

The task of designing complex sample surveys for estimating 
regression parameters has been addressed elsewhere. Specifically, Kalton 
(1983) considered essentially the same problem in an earlier NASA 
Contractor Report. The simplest sampling situation considered by Kalton is 
that of a two-stage design. In the first stage, primary sampling units 
(PSU's) are selected. In the second, individuals are sampled within the 
selected PSU's. For example, for a survey around a single airport, PSU’s 
might correspond to Census Tracts, and individuals within these PSU's might 
correspond to households within Census Tracts. The first multiple 
regression model for such a design considered by Kalton is a classical 
nested random effects model: 

(2.1) yjj = B 0i + Muj + Mzij + ey 

where, B 0i and e^ are random effects corresponding to PSU's and individuals 
within PSU's, repectively. (See equation ( 1 1 ) in Kalton.) 

Note that the slope parameters in equation (2. 1 ), p j and p 2 > are 
constant. They do not vary from cluster to cluster. While this assumption 
is standard for such random effects analysis of covariance models, the 
design implications are somewhat counter-intuitive. As was noted by 
Kalton, assuming a standard cost model, if the within-cluster variability of 
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the x-variables Is the same as that of the associated population variability, 
a sample of a single cluster can be the implied optimal design for 
estimating slope parameters or functions of slope parameters. This seems 
counter-intuitive because one suspects that the structure of the 
relationship (the slope parameters) is probably not constant across 
clusters. Under such circumstances, drawing a sample of a single cluster 
would not be considered a safe alternative. In the following sections, we 
consider models which incorporate this structural variability. 

3. Random Coefficients Regression Models. 

The principal difference between the models considered in this 
section and those employed by Kalton is that here, the slopes are allowed to 
vary from cluster to cluster. They are assumed to be random, with 
components of variability associated with the various stages in a multi- 
stage cluster sample design. Below, three model-design combinations are 
presented. In the first two of these, estimates of regression slopes and of 
variance components are functions of individual slope estimates associated 
with observations belonging to the same penultimate stage sampling units. 

Model I is for a two-stage design as described in Section 2 above. For 
such designs, penultimate stage sampling units are PSU's. Model lil 
incorporates structural varibility at two levels for a three stage design. 
Such a sampling scheme might be used for a national or regional survey with 
cities or counties serving as PSU's, Census Tracts selected within PSU's as 
secondary sampling units (SSU's), and finally households within SSU's. 

For some samples, an estimation strategy based on regression 
parameter estimates obtained within penultimate stage sampling units is 
feasible, however for many others, it is not. For example, in order to obtain 
slope estimates within a cluster, there must be variability in the predictor 
variables in that cluster. In the present case, that means variability in the 
noise exposure measurements within penultimate clusters. For many 
available studies, there is little or no variability in these predictors at the 
penultimate cluster level and the procedures described for Models I and II 
are not feasible. 

Model III is also based on a three-stage design. It contains random 
effects which incorporate variability in regression slope parameters only at 
the first stage. Estimates depend on individual slope estimates associated 
with data from PSU's. Such a procedure could be applied to data from most 
available studies. It should be noted however that should substantial 
variability in regression slopes exist at the S5U level, conclusions obtained 
from analyses based on Model ill should be limited to designs with similar 
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final stage characteristics. That is, in designing future studies, only 
designs which have PSU's of size comparable to those associated with PSU's 
in the studies used to obtain estimates of variance components should be 
considered. Any variability in slope parameters associated with SSU's in 
these studies will be included in the estimate of the variance component 
associated with PSU's. However, the effect on the estimated component 
will change depending on the number of SSU's per PSU. 

3.1 Model I: Two-stage design, parameters random at the 
first stage. 

3.1.1 The model and the design. 

First, we consider a two-stage sample design, together with a model 
which incorporates variability in regression parameters at the first stage: 

Design: Select n PSU's, and within the ith PSU, select n f individuals 
for a total sample size of Znj = n. 


(3.1 ) yjj = B 0j + BjjXjjj + B 2 jX 2 jj+ e,j , where 
B mi = (V a mi ; m=0,1,2, and where 


E(a mj ) = 0;m=0,1,2 


E( a mi a m'i) ^amm' > ^,01 0,1,2 

E(e jj ) = 0;j=1,2,...,n i ;i=l,2,...,n 
E(ejj 2 ) = or e 2 ; j=l,2,...,n, ; i=1,2,...,n 
E(e|je r j- ) = 0 ; i*i' or j^j* 

E(e fJ a mJ -) = 0 ; m=0, 1,2 ; j, j*= 1 ,...,n f ; i=1,...,n 

Here, for simplicity, the independent variables x^j and x 2j j are 
corrected for within PSU means. That is, 

(3.2) x m jj - x m jj - x m j. , m- 1 ,2 , j- 1 ,...,nj , i- 1 ,...,n 
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where x‘ mi j is the raw, uncorrected measurement, and where x' mf . = 

Z x' mi j/nj, is the mean of the uncorrected measurements for the ith PSU. 

3.1.2 Estimating regression parameters, associated 
sampling variances, and optimal design. 

Let b mi be the usual least squares estimate for B mi , m=0,l,2 , 
calculated within PSU’s. That is, 

(3.3) boj - 2! yjj/nj - ijj. , 

<2 X2ij 2 )(^ x lij y ij ) - x,ijX 2 ij)(| x 2i jyij) 

bn = , and 

(Sx^q: x 2ij 2 ) - (^! x li jX 2 ij) 2 

(X x h f)(l x 2ij y ij ) - (? XjjjXjjjKZ x lijyi j) 

b 2j = 

(^Xjjj 2 )^ X 2ij 2 ) - (^ X^jXsij ) 2 

Estimates of the p m , m=0,l,2 are obtained by averaging the PSU estimates, 
i.e. 


(3.4) (3 m = l b mi /n ; m=0, 1 ,2. 

Then, since the within PSU least squares estimates, the b mi , are unbiased 
for the associated B mi , the overall estimates, are unbiased for the 
parameters . That is, 

(3.5) E(p m ) = E, [Ejt (J b mi /n)J 
■ E, [| B mi /n] 

Here, the notation, Ejg represents the conditional expectation taken over all 
samples of individuals (the j's) for a fixed set of PSU's (the i s). 
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The sampling variances of these estimates are determined as follows, 

( 3 . 6 ) Var(p 0 ) = Ej [Var jfi (p 0 )] + Van [Ejs (M 

= Ej (cr 0 2 /nj) + Van B oi /n) 

= (d e 2 I?( 1 /nj)] / n} + (cfaoo/n) • 

The conditional variance notation, Var^ is defined in a manner analogous to 
that of the conditional expectation described above for equation ( 3 . 5 ). If 

there is a constant P 5 U size, n = Z n/n = n. , then ( 3 . 6 ) simplifies as 
follows, 

( 3 . 7 ) Var(p 0 ) = (d e 2 /nn.) + (d a 00 /n) 

= (d e 2 /n.) + (cr a 00 /n), 

where, n. = £ n ( is the total sample size. 

For p,, the sampling variance is developed as follows, 

( 3 . 8 ) Var(p ,) = E, [ Var jfi (p , )] + Van IEjn (& , )] 

= Ej [(o e 2 /n 2 ) Z dj22/nj(djj|d i2 2 ~ tf f i2 2 )] + Van(Z B,j/n) 

= [(d e 2 /n 2 ) Z A i 22 /r!j ] + ff a11 /n , where 

a { 22 = bj 2 2/(^n 1^122 - n2 2 ) > where 
cfj 1 1 = variance of x 5i j within the ith PSU, 
cr j22 = variance of x 2i j within the ith PSU, and 
cr i]2 = covariance of x, y and x 2i j within the ith PSU. 


Again, if n { = n. , this simplifies to 
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(3.9) VaKj*,) = cr e 2 A 2 2 /n. + cr a1 ^/n , where 


A 2 2 - 2,- A i2 2/n . 


Similarly, for p 2 , 

(3. 1 0) Var(p 2 ) = ((cr 0 2 /n 2 ) £ A n j /n,) + a a 22 /n , where 

Aji i = 0' il i/(cTji iCJi22 “ il2 2 ) » an ^ ^ n i = > 

Var(p 2 ) = a 0 2 A n /n. + a 22 /n, where 
An = Z A m /n . 

Under the simple cost model assumed by Kalton, C = C 0 + nC a + n.C b , 
where C 0 is the fixed cost of the survey, C a is the average cost of including 
a cluster in the sample, and C b is the average cost of including an individual 
in the sample, the optimum cluster size for estimating is given by, 

(3.1 1) n.(opt) = (ci e 2 A 22 /cr all }i/2 {C a /C b }i/2 

3.1.3 Estimating ratios of regression parameters, 
associated sampling variances and optimal design. 

Finally, for designing a sample to estimate ratios of regression 
coefficients, one requires the sampling covariances of the estimates, j}j and 

A 

(*2 » 


(3.12) Cov(p,,p 2 ) = [(-d e 2 /n 2 ) £ Aj 12 /nj] + d a i 2 /n , where 
Aj|2 = CTji2/(dj| 1 0' j 2 2 " O' i 12 2 ^ > an d if n s = fi. , 


Cov(p!,p 2 ) = -o e 2 A 12 /n. + d a | 2 /n , where 

&\2 = A i12 /n • 

For estimating the ratio of the two regression coefficients, R = 
(p 2 /(i ! ), we propose to use R = (p 2 /(i,). Then, we can use the Taylor 
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expansion method to obtain an approximation of the variance of this ratio 
estimate, 

(3.13) Var(R) « pf 2 [Var(p 2 ) + R 2 Var(p,) - 2R Cov(p 1 ,p 2 )3 • 

Again, if there is a constant cluster size of n f = n. , this simplifies to 

(3.14) Var(R)~(3,- 2 l(cr 8 2 A^/n. + cr a22 /n) 

+ R 2 (tf e 2 A 22 /n + cr a11 /n) 

- 2R (-cr e 2 A 12 /n. + o a12 /n)] 

= Pi” 2 ([cr e 2 (Ai i + R 2 A 22 + 2R A^ 2 ) / n. ] 

+ [(cr a22 + R 2 cr al1 - 2R tr a12 ) / n ]) . 

Under the cost model described above, the optimum PSU size for estimating 
the ratio is given by, 

(3.15) n.(opt) = {d 8 2 (A n + R 2 A 22 + 2R A 12 ) 

/ (cr a22 + R 2 d a1 , - 2R d a12 )}" 2 {C fl /C b l I/ 2 . 

3.1.4 Estimating variance components. 

Thus, in order to determine the optimum PSU size for estimating 
regression coefficients and ratios of regression coefficients, one requires 
the following information, the average cost parameters, C a and C b , the 
design characteristics, the variance and covariance components of the 
random parameters, and an approximation for the true ratio R. The design 
characteristics describe the within PSU distribution of the x variables in 
terms of Aj t , A 22 , and A 12 . These in turn depend on Oj,,, 0 j22 , and cr il2 
calculated within clusters. The variance components of the random 
parameters, cr a11 , tf a22 , o a12 , and o e 2 can be estimated from previous surveys 
using the methods described below. 


The residual variance, a e 2 , is estimated in the usual manner as, 
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(3.16) 22 (Vij-Oij) 2 22 (yjj - Qij) 2 

a 2 = = 

2 (n s - 3) n. - 3n 

The remaining components are estimated as follows: 

(3.17) = | (b,j - p,)2/(n-l) - a 2 (£ A i22 /nj )/n 

or for constant cluster size, n s = n. , 
o a ]] = 2 (bn ~ (3j)2/(n~ 1 ) - A 2 2/n. . 

(3. 1 8) tfg 22 = 2 (b 2 j - $ 2 ) 2 /(\\- 1 ) - cT 0 2 (£ Aj] j/nj )/n 
or for constant cluster size, n f = h. , 

^a 22 = 2 (bzi “ $ 2 )2/(r\~ 1 ) - (J 0 2 A j j/n . . 

(3.19) a a]2 = 2 (bjj - pjXbjjj - p 2 )An-l) + o'e 2 ( 2 Aj 12 /nj )/n 
or for constant cluster size, n f = n. , 

cr a 1 2 = £ (b^j - P 1 )(b 2l - - p 2 )/(n-1 ) - o e 2 Aj 2 /h. . 

3.2 Model II: Three-Stage Design, Parameters Random at 
the First Two Stages. 

3.2.1 The model and the design. 

Now, we consider a three-stage sample design, together with a model 
which incorporates variability in regression parameters at the first two 
stages: 

Design: Select n PSU's, within the ith P5U, select n s SSU's and 

within the ijth SSL), select n,j individuals. 
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(3.20) y ijk = B 0ij + B,ijX lijlc + B2ijX 2 ijk + e ijk , where 

®mij ~ Pm + ^mi + ^mij » Hl-0,1,2 , j ~ 1 r>j , i-1,...,n . 

E(e ijk ) = E(a mi ) = E(c mij ) = 0 ; m=0, 1 ,2 ; 
k=l,...,n ij ; j=l J ... J n i ; i=1 n . 

E(a mi a m -j ) = or amm . ; m,m'=0, 1,2 ; i=l . 

E(^mij^m’ij ^ ” ^cmm’ > “0, ^ >2 , j- 1 , i- 1 ,...,n , 

E(a mj a mr ) = 0;m,m , =0 > 1,2;i^i" . 

E(c mi jC mrj . ) = 0 ; m,m-0, 1 ,2 ; i*i or j^j* . 

E(a mi c mTj ) = 0;m j m , =0J J 2; 

E(3mi®i'jk^ - E(C m jjej'j' k ) - 0 , ID-0, 1 ,2 . 

The last four lines of (3.20) imply that the e jjk , a mi , and c mij terms are 
independent of each other. 

3.2.2 Estimation, sampling variances, and optimal design. 

Here again, regressions are carried out within the penultimate stage 
sampling units, that is within SSU’s. Let b mi j be the usual, within SSU least 
squares estimate of B mi j, and let 

(3.2 1 ) b m j. - ^ b mi j/njj , 

(3 m = £ b mi ./n , and finally 


R = p 2 /(i 


i • 


Using an argument analagous to that used in section 3.1, it is easy to 
see that the regression parameter estimates, p m , are unbiased for the 
parameters p m . Sampling variances and covariances are also derived in a 
similar manner, 
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(3.22) VartbHj) = Varjj E k(j (b,y) + E u Var k|j (b lu ) 

= Vardan* c Hj ) + Ey [d e 2 ar jj22 / n.jCcTij, ,^^2 ~ ^iji2 2 )) 

= cr al , + £T cn + cr e 2 Aj j22 / rijj . Further, 

(3.23) Cov(b|jj,b 1i j-) - CoVjjjj- Ekjjjjj* (biy.biy) 

+ ^ijj.byj 1 ) 

= CoVjj jj- (an + c,ij , a,i + c,y) + 0 = cr all . Therefore, 

(3.24) Var(Pj) = n~ 2 £ nf 2 £ (b a11 + d cn + d e 2 Ay 22 / ny) 

+ n -2 £ nf 2 2jZj*j' ban . 

= <<T a i i + b c11 ) n- 2 (£n if 1 ) + (tf e 2 /n 2 ) [£ nf 2 (£ Ay 22 /ny)] 

+ b aI1 n~ 2 I (rij-D/nj . 

= (cr an /n) + (b cn /n 2 ) £ nf 1 + (cr e 2 /n 2 ) [X nf 2 (2 Ajj 22 /ny)l . 

In equations (3.22-24), dy mm - and Ay mm - are defined to be the variances, 
covariances and functions of these, for predictor variables x,y and x 2 y, 
calculated within the ijth 55U, analogous to the definitions of a imm ' and 
Aimm' given in equation (3.8). 

Now, if we have a constant P5U and 55U size, nj=n. and ny=nj.=n.. , and 
if the design characteristics Ay 2 2 are constant over SSU's at A 22 , then we 
have, 

(3.24) Var(p,) = (cr a1 ,/n) + (cr c11 /nn.) + (a e 2 A 22 /nn.n..) 

= (b a11 /n) + (d c11 /n.) + (q e 2 A 22 /n„) . 

Under similar assumptions, the Var(p 2 ) and Cov(p 1( p 2 ) are seen to be, 
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(3.25) Var((S 2 ) = (tf a 22 /n ) + (tf C 22 /n *) + (U e 2A i i /n -*> » and 

(3.26) Cov(p,,(s 2 ) = (tfa^/n) + (tf c i 2 /n -) " (ff e 2 ^i 2 /n -) • 

Therefore, for the estimate of the ratio between the two 
coefficients, R = (VP, , 

(3.27) Var(R) ~ p,“ 2 (Var(p 2 ) + R 2 Var(p,) -2R Cov(p, J 2 )} 

= pf 2 {[(d a22 /n) + (cr c22 /n.) + o e 2 A n /n..)] 

+ R 2 i(cr a11 /n) + (cr c11 /n.) + (o 0 2 A 22 /n..)] 

- 2R [(cr al2 /n) + (o cl2 /n.) - (cr e 2 A 12 /n..)]} 

= Pj~ 2 {(o' a22 + R 2 o'a^ - 2R tf a j 2 )/n 
+ (o" c22 + R 2 cf C ] ^ - 2R cf c12 )/n. 
or e 2 (A n + R 2 A 22 + 2R A ]2 )/n.. ] . 

Using these results, combined with a simple cost model for three- 
stage cluster sampling we can arrive at optimum sampling unit sizes in a 
manner similar to that used for Model I. Let, 

(3.28) C = C 0 + Cjn + C 2 n. + C 3 n.. , 

where, C 0 is the constant overhead cost of the survey, Cj is the average cost 
of including a P5U in the sample, C 2 is the average cost of including an SSU 
in the sample, and C 3 is the average cost of including an individual in the 
sample. Then, using the Cauchy-5chwartz inequality, we have the following 
condition for optimum allocation for estimating the ratio R: 

(3.29) (cf a22 + R 2 £T a11 - 2R cr a12 )" 2 / (n C," 2 ) 

= (cr c22 + R 2 tr c11 - 2R o c12 ) ,/2 / (n. C 2 U2 ) . 

= ff.(5„ + R 2 A 22 + 2R A 12 )" 2 / (n.. C 3 1/2 ) 

This relation translates into the following optimum sampling unit sizes; 
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(3.30) h.(opt) = {(d c22 + R 2 cr c1 , - 2R d cl2 )C t 

/ (d a22 + R 2 ^ al 1 ~ 2R 0’ a 12^2^ 1/2 

(3.31) n..(opt) = d e {(A n + R2 A 22 + 2R A 12 )C 2 

f f ff c22 + R 2 ff c11 " 2R 0 'c 12 ^ 3 ) 1/2 . 

3.2.3 Estimating variance components. 

In order to determine necessary sample sizes and optimum allocation, 
one needs some idea of cost parameters, design characteristics summarized 
by A n , A 22 * and A 12 , an approximation of the ratio R and its denominator 
p, , and the variance components d a11 * a a22 , d al2 , d cj1 , d c22 , d ct2 , and d 8 2 . 
These may be obtained from previous survey data using the following 
estimates: 

(3.32) d 0 2 = X?(y jjk -yjj k )2/(n..-3n.) 

(3.33) d cJ | = (^^(bjjj-bn.) 2 - $2 £[(n r 1 )/nj] £ Aij 2 2 /njj}/(n.-n), 
which if nj=n. and njj=n.. , further simplifies to, 
EZ(b ]i j-b li .)2/( n .-n) - d 0 2 A 22 /n.. 

(3.34) d c22 = (I!^(b 2 ij-b 2 j.) 2 - d s 2 ^[(nj-D/nj] ^ Ajj n /njj}/(n.-n), 
which if nph. and njj=n.. , further simplifies to, 
^(b 2 jj-b 2 j.) 2 /(n.-n) - d 0 2 A n /n.. 

(3.35) d C ]2 = (^(b, i j-b li .)(b 2i j-b 2 i.) 

+ d 0 2 Z[(n r 1 )/nj3 £ Aj jl2 /n jj )/(n.-n), 
which if nj=n. and njj=h.. , further simplifies to, 
^S(b 1 jj-6,j.)(b 2 jj-b2j.)/(n.-n) + d 0 2 A )2 /h.. 
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(3.36) cr an = (ECbn-pj) 2 - tfcKn-D/nlSnj-' 

-tf 0 2 [(n-l )/n]£|! A i j 22 /n i- n i j) / (n- 1 ), 
which if nj=n. and n^n.. , further simplifies to, 

2(b|j~P] )2/(n- 1) ~ 0 C ||/Pifi ~ O 0 2 A 22 /n.. i 

(3.37) o ' a2 2 = (£(b 2i -p 2 ) 2 ~ ff^Kn- 1 VnJ^nf 1 

- a r e 2 [(n- 1 )/n]ZZ Ay, ,/nj.njj) / (n- 1 ), 
which if nj=n. and n^n.. , further simplifies to, 
2 (b 2 j-p 2 ) 2/ (n-U - tf^/n. ~ ^e 2 A n /n.. , and finally, 

(3.38) ^ai2 = f^*(^ii - Pj)(^2i - P2^ ~ l)/n]2nj - ' 

+ cr e 2 [(n- 1 )/n]££ Aj j 1 2 / hj.njj) / (n- 1 ), 
which if nj=n. and njj=n.. , further simplifies to, 

Z(b li -P|)(b 2i -p 2 )/(n-l) - cr cl2 /h. + & e 2 A n /n.. . 

3.3 Model III: Three-stage design, slope parameters random 
at the first two stages, error terms nested, individuals within 
SSU’s within PSU*s. 

3.3.1 The model and the design. 

As stated earlier, it is not always possible to estimate regression 
parameters at the 5SU level as required with Model II. For this reason, we 
introduce a model-design combination based on the same sampling scheme 
as that for Model 11, but with a model which allows for variability of slope 
parameters among first stage units only. 

Design: Select n PSU’s, within the ith PSU, select nj SSU's and 

within the ijth 55U, select ny individuals. 
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(3.39) y jjk = B 0ij + B jjXnjk + B 2i x 2ijk + e jjk , where 

B 0 ij = Po + a 0i + c 0ij ; j=l r»i ; 1= 1 n . 

^mi ~ Pm + ^mi > ^ *2 i i- 1 »h 

E(e ijk ) = E(a mj ) = E(c mij ) = 0 ; m=0, 1 ,2 ; 
k= t ; j=1,...,n i ; i=l,...,n . 

E( c ojj 2 ) - tf c00 , 

E(a mi a ml ) = a amm . ; m,m'=0, 1,2; i=l ,...,n . 

E(a mj a mT ) = 0 ; m,m’=0, 1,2 ; i*i' . 

ECcoijCoij- ) = 0 ;i*i or j^j’ 

E(a m jC 0jj ) = 0 ; m=0, 1 ,2 ; 

E(a m je,-j k ) - E(c m jjej'ji,. ) - 0 , m-0, 1 ,2 . 

3.3.2 The estimates, associated sampling variances, and 
optimal design. 

Consider the following estimates for the slope parameter 
associated with the ith P5U: 

(3.40) (£Zx 2 jj k 2 )(£Zx]jj k yjj k ) ~ (ZZx|jj k x 2 jj k )(£Zx 2 jj k yjj k ) 

b|j = 

(^£ x 1ijk 2 X££ x 2ijk 2 ) " (?£ x 1ijk x 2ijk) 2 

(?^* x 2ijk 2 ^^^* x !ijkyijk) “ ^^* x lijk x 2ijk^^*^* x 2ijkyijk^ 

4 

where, d f = (ISx 1ijk 2)(^x 2jjk 2) - (22x 1jjk x 2ijk ) 2 

= nj. 2 (cTj n Pj 22 -dj 12 2 ) . 
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Here, the x mj j k are mean corrected within PSU's as for Model I. Then for the 
estimate of the overall mean slope, pj, consider 

(3.41) (3| = £b,j/n . 

It can be shown, using the following logic, that p lf as defined in 
equations (3.40) and (3.41 ) above, is unbiased for the mean slope p, . First, 
notice that the observations y iJk can be expressed as follows: 

(3.42) Yij k » ^o + aoi + c 0 ij + (pj+a^x,^ * (p 2 +a 2i )x 2ijk + e ijk . 

Now, the conditional expectation of b (i given that the ith PSU is in the 
sample is given by, 

(3.43) Ej k |j(bjj) = Ej k |j dj 1 ^^(?i'X2ij k 2 )X|ij k ~ ^^• x 1ijkX2ijk)x 2 ij k ]yjj k 

= 1 Ej k |j ((^!Zx2ij k 2) [?2x,jj k C 0 jj + 

+ (^x lijk x 2ijk )((3 2 +a 2i )] 

- (|!I!x]jj k x 2 jj k )[ZI!x 2 j j k cojj + ^Sx 1 jj k x 2 i j k (Pi+a ]j ) 

+ ?£ x 2ijk 2 (p2 +a 2i^ ' 

= df 1 E jk |i {|[I (?Ix 2ijk 2)x, ijk - (|Ix, ijk x 2ijk )x 2jjk ]c 0 jj 
+ [(?2x,i jk 2)(|:Zx 2jjk 2) - IIx 1jjk x 2i j k )2] (bn+a,,) 

“Pi + a 1i ■ 

Thus, E(|,) = Ej Ej k |j (p,) = p, . That is, p] is unbiased for p, . Similarly, it 
can be shown that, p 2 , defined in an analogous fashion is unbiased for p 2 . 

Using a similar treatment, sampling variances and covariances can be 
derived for these estimates. For p, we have, 

(3.44) Var(p,) = Ej Var jk(j (p,) + Var s Ej kji (p,) . Now, 
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(3.45) Van Ej kt i (p,) = VarjiZajj/n) = o al1 /n , and 

(3.46) Varj k(i (pj) = £ Var jk j (b n ) / n 2 . Now, 

(3.47) Var jk( j (b„) = df 2 Var jkH £X A 1ijk y ijk , where 

(3.48) A 1f j k = (^Zx 2 jj k 2 )x 1i j k - (Z2x 1i j k X2ij k )x 2 ij k 

= n i*°’i22 x 1ijk ’ 12 x 2ijk • 50, 

(3.49) Var jkti (b ti ) = df 2 {£Z A lijk 2 o e 2 ♦ X(XA liJk ) 2 a c00 ). Now, 

(3.50) XX A ljjk 2 = n]. 2 [o i22 ZZx^jfc 2 + cr i12 XXx 2ijk 2 

- 2cT i22 ff i12 Z^XjjjfcX^ ]. 

= Hj. 3 (0’j 22 a’in + (Jil2 2 0'i22 ~ 2cr i22 cr i 12 2 ^ 

= Rj. 3 ar i22 (^il jO'i22“ CJ i12 2 ^ = n i* ^i22 - ar> d 

(3.51) Z(^A,jj k ) 2 = nj. 2 X(£ 0'i22 x 1ijk " i 12 x 2 ijk^ 2 

= nj. 2 Znjj 2 (o' j22 2 x 1j j. 2 + <J i12 2 x 2j j. 2 
- 2o'j 22 tf n2 x lj j.x 2 jj.) . 

Now, if we have a constant SSU size, n j j=h j . ! =n j ./nj , this simplifies to, 

(3.52) nj. 2 rij. (O i22 O f11 O j15 + o j12 2 Oj 22 Oj 22 - 2 cr j22 O j)2 O j12 ), 
where, for example, 

(3.53) O in = (SnjjXjjj 2 )/(XZx ijk 2 ) 

is the proportion of within PSU variance "explained" by SSU's. Now if 
0 nr 0 i 22 =0 ii 2 =0 i > this further simplifies to, 
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( 3 . 54 ) E(Z Aujfc ) 2 = rv 2 rij. Oj o' i2 2 ( cr i22 cr iiro r it2 2 ) 

= (nj./n.) Qj cr i2 2 dj . 

So, assuming that the SSU’s are of constant size within the ith PSU, i.e. that 
n i j=n i .=n i ./n i , and that the proportions of within ith PSU variances and 

covariance of the two predictor variables explained by SSU’s are all equal to 
Oj , we have, 

(3.55) Var jk|j b n = dj^tnj.cr^djOfe 2 + ff C | ^i.O^di) 

= {n i .o’ j 2 2 (d e 2 ♦ O^oo/n^/dj . 


Now, if we further assume constant number of individuals per PSU, and a 
constant number of SSU’s per PSU, as well as constant within PSU design 
characteristics, 

(3.56) n s . = £n f ./n = n../n = n.. , 

nj = £nj/n = n./n = n. , and 
Oi = 0 and dj = d , we have, 

(3.57) Var (pj) = n-2 Z[n f . Gr i22 (d e 2 + ttdcoo /n i) / djl + d all /n 


= n -2 £ [n../n] [or 22 (d e 2 + Od c 00 n/n.) / d] + cr an /n 
= (A 22 /n..) (o e 2 + Oo c 00 /n.) + o a11 /n 
= (ff a , ,/n) + a 22 [(Qar c 00 /n.) + cr e 2 /n..] . 

Similarly for p 2 and the covariance, we have 

(3.58) Var (p 2 ) = (cr a 2 2 /n) + An [(Ocr c00 /n.) + or e 2 /n..] , and 

(3.59) Cov (p 1f p 2 ) = (d a i 2 /n) " A i2 [(Ocr c oo/n.) + cr e 2 /nj . 
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Thus, If we estimate the ratio of the expected values of the two slope 
parameters by the ratio of the individual parameter estimates, R = > 

we have, 

(3.60) Var(R) = pf 2 {(tf a2 2 /n) + A n [(Ocr c00 /n.) + o e 2 /n..] 

+ R 2 «d al ,/n) + a 22 [(Od c o 0 /n.) + ar e 2/n --M 

- 2R{(o' al2 /n) - a 12 [(Ocf c00 /n.) + o e 2 /n..]] } 

= Pi -2 { (o a22 + R 2 cr a i 1 -2Ro a12 )/n 

+ (A n +R 2 A 22 +2RA 12 ) (Oo c oo/n. + cr e 2 /n.. ) } . 

Using the cost model described in Section 3.2.2, equation (3.28), for 
Model II, the Cauchy-Schwartz inequality yields the following optimum unit 
sizes for estimating R: 

(3.6 1 ) n.(opt) = {(A, | + R 2 A 22 + 2RA] 2 )Qq c ooC| 

t (^ a 22 + R 2 ^at1 _ 2RO a i2)C2^ ^ » 300 

(3.62) n..(opt) = (a e 2 /O c00 Q)' /2 (C 2 /C 3 ) 1/2 . 

3.3.3 Estimating variance components. 

Again, estimates of variances and covariances of the random 
parameters are required for determing sample allocations. These can be 
obtained as follows: 

(3.63) ^(yurM 2 


II(n n -3) 

i j J 

( 3 . 64 ) ^( Q , j .- g i ..)2 0 e 2 

A 

o c0 o = ~ — 

£(nj-l) njj 
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(3.65) 2(bjj 0^22^cOO 

O'all = “ A 22&e 2 ~ » 

n-l n. 

(3.66) S(b2j - ^2^ OAntTcoo 

0f a 22 = -All^e 2- > ar >d 

n-1 n. 


(3.67) 


?^ 1 j-p 1 )(b 2 j-p2^ 




a al2 = + A 12 CT e 2 + 

n-1 


n. 


4. An Example, 


Fields and Walker (1982) describe a study of the effects of railway 
noise on residents based on a social survey of 1453 respondents in 75 areas 
in Great Britain. Among other things, estimates of 24 hour L eq dB(A) and the 
difference between daytime 1 5 hour L 0q and nightlme L eq were obtained for 
each individual surveyed using physical noise measurements. Several 
measurements of annoyance were obtained using an interviewer 
administered questionnaire. Here, we make use of the four category verbal 
scale of annoyance obtained from Question 1 7 of that study: "Does the noise 
of the trains bother or annoy you?" Possible responses ranged from ( I ) "Not 
at all," to (4) "Very much." 

For purposes of illustration, we regress this annoyance variable (y 4 j) 
on the two independent variables, 24 hour L eq (x,,j) and the daytime- 
nightime L eq difference (x 2i j). An ordinary least squares regression yields 
the following parameter estimates: 

(4.1) b 0 = 0.0327, se(b 0 ) = 0.132 

b, = 0.0287, se(bi) = 0.00229 
b 2 = -0.00 110, se(b 2 ) = 0.00500 
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d 0 = 0.91 1 

The standard errors in (4. 1 ) are those obtained from the least squares 
analysis assuming simple random sampling, and are provided only as rough 
indications of sampling variability. 

We now analyse these data assuming Model I as described in equation 

(3. 1 ). In doing so, regression coefficients are estimated within PSU's, and 
then averaged over PSU’s. As stated in Section 3, this requires an 
acceptable joint distribution of independent variables within PSU’s. In 
particular, there must be reasonable variation in both Xjjj and x 2i j and the 
two independent variables must not be too highly correlated within PSU’s. 

The acceptability of these within PSU design characteristics can be 
determined by inspecting the design measures A m and A i2 2 as described in 
equation (3.8) above. After eliminating those PSU’s with insufficient 
variability in independent variables, we are left with 44 PSU's for Model I 
analysis. 

Least squares estimates of the regression slopes were calculated 
within each of the 44 PSU’s, and overall estimates of the two slopes were 
obtained as the averages of these PSU estimates as described in equation 
(3.4). By treating each PSU as an independent replicate, estimates of 
standard errors of these estimates can be obtained based on the variability 
of the individual PSU estimates. The results of this analysis are given 
below: 


(4.2) £, = 0.0527 se((3,) = 0.0169 
p 2 = 0.689 se(p 2 ) = 0.725 
R= 13.07 se(R)= 14.57 

Variance components estimates are obtained by using equations 
(3.16-19) as, 

(4.3) d e = 0.659 
da,, = 0.00819 
d a22 = 4.331 
0a 12 = 0.033 1 
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Using these estimates in the formulas developed In Section 3 for 
optimum sample design, equations (3. 1 1 ) and (3. 15) we get the following 
estimated optimum cluster sizes for estimating the three parameters, p,, 
p 2 and R 

(4.4) n.(opt) for estimating p| = 4.82(C a /C b ) ,/2 

n.(opt) for estimating p 2 = 13.8(C a /C b ) ,/2 

n.(opt) for estimating R = 13.8(C a /C b ) ,/2 

For example, for estimating the ratio R, optimum cluster sizes for various 
ratios of C a /C b are given below: 


c a /c b 

5 

10 

25 

50 

h.(opt) 

31 

44 

69 

98 


These calculations are only meant to be illustrative, and several 
qualifications concerning these estimates should be made. First, it should 
be noted that the estimate of p 2 , and consequently that of the ratio R, is 
not very precise. It follows that the corresponding variance components and 
hence the optimum cluster sizes are not very precisely determined either. 

In addition, the actual design used in the Fields and Walker study does not 
correspond exactly to that described in Model I. A three-stage design rather 
than a two-stage design was employed. As a result, these calculations are 
relevant only for designs with PSU sizes similar to those observed in the 
study used here. 

5. Summary and Conclusions 

Interview studies of residents' response to noise are often based on 
two-stage sample designs. For these designs, samples of individuals are 
drawn within samples of compact study areas. In a typical survey, such a 
compact study area could consist of a neighborhood or a set of adjacent 
households. If the variability of the noise exposure variables within these 
compact study areas is not large, then the techniques described by Kalton 
(1983) can be used to determine optimal cluster design. On the other hand, 
if there is substantial within area variation in noise exposure, then the 
possibility of variability in the structural relationships (the "true” 
regression coefficients) over clusters should be considered. In such 
situations, the methods described for Model I in Section 3 can be used to 
assist in sample design. 
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Other noise studies have been based on a more complex design, a 
three-stage design, in such a design, samples of individuals are drawn from 
samples of compact study areas, which in turn are drawn from a sample of 
larger areas. For example, for a multi-airport study, these larger areas 
might correspond to cities, if there is substantial variability in noise 
exposure within compact study areas, the techniques based on Model III 
described in Section 3 can be used to assist in sample design. On the other 
hand, if there is substantial variability in noise exposure within compact 
study areas, then the methods described for Model II should be employed. 
These methods allow for the possibility of variability in the "true" 
regression coefficients among compact areas. 

The statistical techniques described in this report can be used to 
provide assistance in designing noise surveys. It should also be noted that 
these techniques are more generally useful in a broad range of sample 
survey applications. Indeed, the conclusions regarding multi-stage sample 
design are applicable for any two-variable linear regression model of the 
form given in equation (1.1). 
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